Content
ℹ️ R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the early 1990s. The development of R was motivated by the desire for an open-source statistical computing and graphics environment that was accessible to researchers and students. The name “R” was partly derived from the first letters of the creators’ first names, Ross and Robert, and also as a play on the name of the S programming language, from which R was conceived as an implementation.
Ross Ihaka, a statistician, and Robert Gentleman, who also has a background in statistics, were both faculty members at the University of Auckland when they began working on R. Their collaboration aimed to create a software environment for their students that was not only free but also flexible enough to perform a wide range of statistical analyses and graphical representations.
The initial release of R to the public occurred in 1995, and it quickly gained popularity in the statistical computing community due to its open-source nature, allowing users to modify, improve, and distribute the software. The Comprehensive R Archive Network (CRAN) was established as a repository of R and its packages, facilitating the contribution and sharing of code by users worldwide.
Strengths
Weaknesses
Install
3 areas
## [1] 20
<-## [1] 10
ℹ️ The use of <- as the assignment operator in R is
rooted in its history and the influence of the S language.
While <- is the traditional and recommended assignment
operator in R, = is also supported for assignment.
However, the use of <- is encouraged in the R community
and in coding standards for clarity and tradition.
## [1] 110
File/New Project
*.R extensionRun or
Ctrl + EnterFile/New File/R Notebook
chunksCtrl + alt + i)Ctrl + Enter to execute the current
line of code, or Ctrl + Maj + Enter to execute the whole
chunkinstall.packages() or tab
Packages and InstallInstalllibrary(package_name)library(xxx) should be placed at the
top of the script/notebook💻 Practice
tidyverse and
questionrℹ️ The questionr package in R is a useful tool designed primarily for social scientists, providing a set of functions to facilitate the analysis of survey data. It was developed to simplify common tasks associated with survey data analysis, including data manipulation, summarization, and visualization, as well as to provide easy access to descriptive statistics that are particularly relevant in social science research.
and also
typeof(objet) to display the type of an object
## [1] "character"
c() function, where
c() means concatenate or combine## [1] 1 10 15
## [1] "hello" "world"
💡 The parenthesis in the chunk execute the line of code and displays the output
Mathematical operations are applied element-wise.
## [1] 175 154 185 192
## [1] 23.51020 24.87772 32.14025 25.77040
Other functions to create a vector
## [1] 0 0 0 0 0 0 0 0 0 0
: to create a sequence of numbers that follow each other
by an interval of 1
## [1] 1 2 3 4 5 6 7 8 9 10
seq() to create a sequence of numbers with custom start,
end and step values
## [1] 0 4 8 12 16 20
rep() to repeat a vector
## [1] 1 2 3 4 5 1 2 3 4 5
💡 Notes:
seq() and rep() have other arguments, see
the help with ?seq and ?rep.rnorm, runif, rbinom,
rpois etc.Indexation system
[]## [1] 1.75
ℹ️ The decision to start indexing from 1 in R :
Ease for Non-programmers: Many users of R come from academic backgrounds in statistics, mathematics, and various fields of science, where 1-based indexing is the norm. Starting indexes at 1 can be more intuitive for individuals who are not primarily trained in computer science but are using R for data analysis, statistical modelling, and research.
Alignment with Theoretical Concepts: In statistical formulas and mathematical expressions, indexes typically start at 1. Using 1-based indexing in R makes it easier to translate these formulas and expressions directly into code without having to adjust the index values.
to select multiple elements, provide a vector of indices inside the brackets
## [1] 1.75 1.85
slicing can be used to select a range of elements
## [1] 1.54 1.85 1.92
## [1] "M" "F" "M" "M" "F" "F"
## [1] "character"
## [1] M F M M F F
## Levels: F M
## [1] "integer"
## Factor w/ 2 levels "F","M": 2 1 2 2 1 1
possible to change the names of the levels with the levels function
## Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1
A function is called by its name, followed by parentheses containing zero, one, or multiple arguments.
?function_name (or help("function_name"))
to display the function’s help documentation.length(v): returns the length of v (number of
elements).mean(v): returns the mean of v.min(v), max(v): return the minimum
(maximum) value of v.sum(v): returns the sum of the values in v.range(v): returns a vector containing the minimum and
maximum values.unique(v): returns a vector derived from v with
duplicate values removed.## [1] 1.765
## [1] 1.54
With rbind and cbind
rbind to combine vectors by rowcbind to combine vectors by column## [1] 1 2 3
## [1] 4 5 6
rbind
## [,1] [,2] [,3]
## v1 1 2 3
## v2 4 5 6
cbind
## v1 v2
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
With rowMeans and colMeans**
rowMeans to calculate the mean by rowcolMeans to calculate the mean by column## v1 v2
## [1,] 1 2
## [2,] 3 6
## [3,] 5 10
## [4,] 7 14
## [5,] 9 18
## [1] 1.5 4.5 7.5 10.5 13.5
## v1 v2
## 5 10
💡 There are also rowSums and colSums for calculating sums.
💻 Practice
Exercice 1. With the vectors below, which represent three households:
abs)Exercice 2. With the vector below:
education_levels <- c("Bachelor", "Master", "Ph.D")
(education <- sample(education_levels, 10, replace = TRUE))## [1] "Ph.D" "Master" "Bachelor" "Bachelor" "Ph.D" "Master"
## [7] "Master" "Master" "Ph.D" "Bachelor"
💡 sample() takes a sample of the specified size from
the elements of x using either with or without replacement.
Exercice 3. With the vectors below, which represent the note of 6 students in 3 disciplines:
df1 <- data.frame(
gender=factor(sample(c("M", "F"), 10, replace=T)),
age=sample(18: 100, 10, replace=T),
education=education
)
df1## gender age education
## 1 F 53 Ph.D
## 2 F 55 Master
## 3 M 79 Bachelor
## 4 M 80 Bachelor
## 5 F 70 Ph.D
## 6 M 80 Master
## 7 M 50 Master
## 8 M 99 Master
## 9 F 43 Ph.D
## 10 M 66 Bachelor
We can access to a variable/column/vector in a dataframe with
$
## [1] F F M M F M M M F M
## Levels: F M
## [1] F M
## Levels: F M
Let’s take an example dataset, hdv2003, provided by the
questionr package that was previously loaded. We load the
dataset with the following command:
This places the dataset into the working environment.
hdv2003 is a subset of the Histoire de vie survey
conducted by INSEE in 2003. It contains 2,000 individuals and 20
variables.
Click on the object in the Environment tab to view the data table (you can also use the function View(hdv2003)).
💡 data() written in the console displays all available
datasets (R core + loaded libraries).
Useful functions to discover a dataframe
names(df): displays the column namesnrow(df): returns the number of rowsncol(df): returns the number of columnsdim(df): returns a vector with the dataset dimensions,
i.e., nrow and ncolnames(df): returns the variable (column) namesstr(df): displays information about the dataset – each
variable with its type and the first few values (if you click on the
blue arrow in the environment, it shows the content of
str(df))head(df): displays the first rowstail(df): displays the last rows## [1] 2000 20
## [1] "id" "age" "sexe" "nivetud"
## [5] "poids" "occup" "qualif" "freres.soeurs"
## [9] "clso" "relig" "trav.imp" "trav.satisf"
## [13] "hard.rock" "lecture.bd" "peche.chasse" "cuisine"
## [17] "bricol" "cinema" "sport" "heures.tv"
💻 Practice
Load the dataset hdv2003 from package
questionr
Creation:
matrix(data, nrow, ncol, byrow=FALSE, dimnames=NULL)diag(n) : diagonal matrix with n ones on
the diagonaldiag(v) : diagonal matrix with the vector
v on the diagonal## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Matrix operations
%*% : matrix productt(mat) : matrix transposesolve(mat) : matrix inversedet(mat) : matrix determinantℹ️ More info:
array() function to create arrays.## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
list() function and elements are
separated by commas## $my_vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $my_sentence
## [1] "Hello world!"
an element can be accessed by its name with $
## [1] 1 2 3 4 5 6 7 8 9 10
The tidyverse
Set of libraries aimed at facilitating the management of the dataset (data cleaning, data wrangling, preprocessing)
library(tidyverse)
readr (import of data), tibble (new dataframe), forcats (qualitative variables), stringr (for characters), tidyr (data formatting), dplyr (data manipulation), ggplot2 (visualization), purrr (programming)
## # A tibble: 2,000 × 20
## id age sexe nivetud poids occup qualif freres.soeurs clso relig
## <int> <int> <fct> <fct> <dbl> <fct> <fct> <int> <fct> <fct>
## 1 1 28 Femme Enseignement… 2634. Exer… Emplo… 8 Oui Ni c…
## 2 2 23 Femme <NA> 9738. Etud… <NA> 2 Oui Ni c…
## 3 3 59 Homme Derniere ann… 3994. Exer… Techn… 2 Non Ni c…
## 4 4 34 Homme Enseignement… 5732. Exer… Techn… 1 Non Appa…
## 5 5 71 Femme Derniere ann… 4329. Retr… Emplo… 0 Oui Prat…
## 6 6 35 Femme Enseignement… 8675. Exer… Emplo… 5 Non Ni c…
## 7 7 60 Femme Derniere ann… 6166. Au f… Ouvri… 1 Oui Appa…
## 8 8 47 Homme Enseignement… 12892. Exer… Ouvri… 5 Non Ni c…
## 9 9 20 Femme <NA> 7809. Etud… <NA> 4 Oui Appa…
## 10 10 28 Homme Enseignement… 2277. Exer… Autre 2 Non Prat…
## # ℹ 1,990 more rows
## # ℹ 10 more variables: trav.imp <fct>, trav.satisf <fct>, hard.rock <fct>,
## # lecture.bd <fct>, peche.chasse <fct>, cuisine <fct>, bricol <fct>,
## # cinema <fct>, sport <fct>, heures.tv <dbl>
💡 Notes:
read_excel, read_stata,
read_sas etc. to import data from other formats.To export a dataset in a csv file, there exists write_csv and write_csv2.
Serialization
*.Rdata)save(objects, file="file.Rdata")load("file.Rdata") to reload the objects into
the R session environmentWith forcats {#forcats}
fct_recode to rename the levels
## n % val%
## Ouvrier specialise 203 10.2 12.3
## Ouvrier qualifie 292 14.6 17.7
## Technicien 86 4.3 5.2
## Profession intermediaire 160 8.0 9.7
## Cadre 260 13.0 15.7
## Employe 594 29.7 35.9
## Autre 58 2.9 3.5
## NA 347 17.3 NA
💡 The freq function from the questionr
package provides information on a categorical variable: counts,
percentages (with and without NAs).
(?questionr::freq).
💡 The categories of a categorical variable are called
levels.
## [1] "Ouvrier specialise" "Ouvrier qualifie"
## [3] "Technicien" "Profession intermediaire"
## [5] "Cadre" "Employe"
## [7] "Autre"
fct_recode(vector, new name = old name, ...)
hdv2003$qualif_grouped <- fct_recode(
hdv2003$qualif,
"Ouvrier_spe" = "Ouvrier specialise",
"Ouvrier_qual" = "Ouvrier qualifie",
"Interm" = "Profession intermediaire"
)
freq(hdv2003$qualif_grouped)## n % val%
## Ouvrier_spe 203 10.2 12.3
## Ouvrier_qual 292 14.6 17.7
## Technicien 86 4.3 5.2
## Interm 160 8.0 9.7
## Cadre 260 13.0 15.7
## Employe 594 29.7 35.9
## Autre 58 2.9 3.5
## NA 347 17.3 NA
to recode a factor as a missing value (NA), set NULL as new name
hdv2003$qualif_grouped <- fct_recode(hdv2003$qualif_grouped, NULL="Autre")
freq(hdv2003$qualif_grouped)## n % val%
## Ouvrier_spe 203 10.2 12.7
## Ouvrier_qual 292 14.6 18.3
## Technicien 86 4.3 5.4
## Interm 160 8.0 10.0
## Cadre 260 13.0 16.3
## Employe 594 29.7 37.2
## NA 405 20.2 NA
To code the NA values as a factor, use
fct_na_value_to_level
## [1] "Ouvrier_spe" "Ouvrier_qual" "Technicien" "Interm" "Cadre"
## [6] "Employe"
hdv2003$qualif_grouped <- fct_na_value_to_level(hdv2003$qualif_grouped)
levels(hdv2003$qualif_grouped)## [1] "Ouvrier_spe" "Ouvrier_qual" "Technicien" "Interm" "Cadre"
## [6] "Employe" NA
fct_collapse
hdv2003$qualif_rec <- fct_collapse(
hdv2003$qualif,
"Ouvrier"=c("Ouvrier specialise", "Ouvrier qualifie"),
"Interm"=c("Technicien", "Profession intermediaire"))
freq(hdv2003$qualif_rec)## n % val%
## Ouvrier 495 24.8 29.9
## Interm 246 12.3 14.9
## Cadre 260 13.0 15.7
## Employe 594 29.7 35.9
## Autre 58 2.9 3.5
## NA 347 17.3 NA
💡 Notes:
fct_other() to group a set of categories into the
“Other” category.fct_lump() to group the least frequent categories into
the “Other” category.questionr, accessible
via Addins →
Levels Recoding.hdv2003$qualif_rec <- fct_relevel(
hdv2003$qualif,
"Cadre", "Profession intermediaire", "Technicien",
"Employe", "Ouvrier qualifie", "Ouvrier specialise",
"Autre"
)
freq(hdv2003$qualif_rec)## n % val%
## Cadre 260 13.0 15.7
## Profession intermediaire 160 8.0 9.7
## Technicien 86 4.3 5.2
## Employe 594 29.7 35.9
## Ouvrier qualifie 292 14.6 17.7
## Ouvrier specialise 203 10.2 12.3
## Autre 58 2.9 3.5
## NA 347 17.3 NA
## Factor w/ 7 levels "Cadre","Profession intermediaire",..: 4 NA 3 3 4 4 5 5 NA 7 ...
💻 Practice
n % val%
Pratiquant 708 35.4 35.4
Appartenance 760 38.0 38.0
Ni croyance ni appartenance 399 20.0 20.0
Rejet 93 4.7 4.7
NSP 40 2.0 2.0
n % val%
N'a jamais fait d'etudes 39 2.0 2.1
Études primaires 427 21.3 22.6
1er cycle 204 10.2 10.8
2eme cycle 183 9.2 9.7
Enseignement technique ou professionnel 594 29.7 31.5
Enseignement superieur 441 22.0 23.4
NA 112 5.6 NA
n % val%
Enseignement superieur 441 22.0 23.4
Enseignement technique ou professionnel 594 29.7 31.5
2eme cycle 183 9.2 9.7
1er cycle 204 10.2 10.8
Études primaires 427 21.3 22.6
N'a jamais fait d'etudes 39 2.0 2.1
NA 112 5.6 NA
from a numeric variable
With cut function
breaks argument allows defining class
intervalsbreaks=number: the interpreter will create
number classes of equal width
## n % val%
## (17.9,33.8] 454 22.7 22.7
## (33.8,49.6] 628 31.4 31.4
## (49.6,65.4] 556 27.8 27.8
## (65.4,81.2] 319 16.0 16.0
## (81.2,97.1] 43 2.1 2.1
breaks=vector: the interpreter will create classes
according to the values in the vector
hdv2003$age_classe2 <- cut(
hdv2003$age,
breaks = c(18, 25, 35, 45, 55, 65, 100),
include.lowest = T
)
freq(hdv2003$age_classe2)## n % val%
## [18,25] 191 9.6 9.6
## (25,35] 338 16.9 16.9
## (35,45] 390 19.5 19.5
## (45,55] 414 20.7 20.7
## (55,65] 305 15.2 15.2
## (65,100] 362 18.1 18.1
💡 package questionr provides a graphical interface for cut: Addins/Numeric range dividing
if_else
## [1] "greater than 10" "greater than 10"
## [3] "lower than (or equal to) 10" "greater than 10"
## [5] "lower than (or equal to) 10"
the test can be built with several conditions
hdv2003$statut <- if_else(
hdv2003$sexe == "Homme" & hdv2003$age > 60,
"Men over 60",
"Other"
)
freq(hdv2003$statut)## n % val%
## Men over 60 222 11.1 11.1
## Other 1778 88.9 88.9
case_when
an extension of if_else when several conditions are
needed.
It is written as condition ~ value
⚠️tthe conditions are tested sequentially, so it is necessary to start
from the most specific condition
hdv2003$statut <- case_when(
hdv2003$age > 60 & hdv2003$sexe == "Homme" ~ "Man over 60",
hdv2003$age > 50 & hdv2003$sexe == "Femme" ~ "Woman over 50",
TRUE ~ "Other"
)
freq(hdv2003$statut)## n % val%
## Man over 60 222 11.1 11.1
## Other 1312 65.6 65.6
## Woman over 50 466 23.3 23.3
As TRUE is always TRUE, all the other values will be coded as “Other”.
💻 Practice
n % val%
[0,1] 684 34.2 34.3
(1,2] 535 26.8 26.8
(2,4] 594 29.7 29.8
(4,6] 138 6.9 6.9
(6,12] 44 2.2 2.2
NA 5 0.2 NA
n % val%
Autre 1971 98.6 98.6
Cinéma et BD 29 1.5 1.5
n % val%
Other 1001 50.0 50.0
Woman with more than 2 B&S 546 27.3 27.3
Man with more than 2 B&S 453 22.7 22.7
to manipulate the data, it has to be well organized, “tidy”:
example of a non-tidy dataset: population of 3 countries over 4 years
for this dataset to be “tidy”, we need
The dataset should be of the form
pivot_longer
Function pivot_longer to create a longer dataset, i.e. with more rows
## country 2002 2007
## 1 Belgium 10311970 10392226
## 2 France 59925035 61083916
## 3 Germany 82350671 82400996
columns 2002 and 2007 should be in a columns “year” and the values in a columns “population”
## # A tibble: 6 × 3
## country annee population
## <chr> <chr> <dbl>
## 1 Belgium 2002 10311970
## 2 Belgium 2007 10392226
## 3 France 2002 59925035
## 4 France 2007 61083916
## 5 Germany 2002 82350671
## 6 Germany 2007 82400996
pivot_wider
Function pivot_wider to create a wider dataset, i.e. with more columns
## country continent year variable value
## 1 Belgium Europe 2002 lifeExp 78.320
## 2 Belgium Europe 2002 pop 10311970.000
## 3 Belgium Europe 2007 lifeExp 79.441
## 4 Belgium Europe 2007 pop 10392226.000
## 5 France Europe 2002 lifeExp 79.590
## 6 France Europe 2002 pop 59925035.000
## 7 France Europe 2007 lifeExp 80.657
## 8 France Europe 2007 pop 61083916.000
the column “variable” should be divided in two variables, lifeExp and pop
## # A tibble: 4 × 5
## country continent year lifeExp pop
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Belgium Europe 2002 78.3 10311970
## 2 Belgium Europe 2007 79.4 10392226
## 3 France Europe 2002 79.6 59925035
## 4 France Europe 2007 80.7 61083916
💡 Notes:
separate and unite functions to separate
and unite columnsseparate_rows to separate a column into several
rowscomplete function to complete missing combinationsdplyr provides functions that facilitates data
manipulation
For the examples, we use the datasets from the package
nycflights23.
This is the flight data of 3 airports of New-York in 2023.
It is necessary to install the package nycflights23 and
then to load the three datasets
To select rows based on their index in the dataset
## # A tibble: 7 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 AAP Andrau Airpark 29.7 -95.6 79 -6 A Amer…
## 2 ABE Lehigh Valley International Airport 40.7 -75.4 393 -5 A Amer…
## 3 ABI Abilene Regional Airport 32.4 -99.7 1791 -6 A Amer…
## 4 ABL Ambler Airport 67.1 -158. 334 -9 A Amer…
## 5 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Amer…
## 6 ABR Aberdeen Regional Airport 45.4 -98.4 1302 -6 A Amer…
## 7 ABY Southwest Georgia Regional Airport 31.5 -84.2 197 -5 A Amer…
To select rows in the dataset based on conditions.
Example: all the flights of january
## # A tibble: 36,020 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 1 1 1 2038 203 328 3
## 2 2023 1 1 18 2300 78 228 135
## 3 2023 1 1 31 2344 47 500 426
## 4 2023 1 1 33 2140 173 238 2352
## 5 2023 1 1 36 2048 228 223 2252
## 6 2023 1 1 503 500 3 808 815
## 7 2023 1 1 520 510 10 948 949
## 8 2023 1 1 524 530 -6 645 710
## 9 2023 1 1 537 520 17 926 818
## 10 2023 1 1 547 545 2 845 852
## # ℹ 36,010 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Several conditions
Example: all the flights with departure delay between 10 and 15 minutes
## # A tibble: 17,829 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 1 1 520 510 10 948 949
## 2 2023 1 1 613 600 13 834 836
## 3 2023 1 1 810 800 10 1153 1147
## 4 2023 1 1 811 800 11 1122 1136
## 5 2023 1 1 814 800 14 1048 1058
## 6 2023 1 1 833 820 13 1436 1435
## 7 2023 1 1 840 828 12 1136 1140
## 8 2023 1 1 858 845 13 1042 1045
## 9 2023 1 1 859 844 15 1207 1209
## 10 2023 1 1 1005 950 15 1226 1235
## # ℹ 17,819 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
💡 filter can be used with & (AND) and
| (OR) operators.
Possible to supply the result of a function as a condition
Example: flights with the longest distance
## # A tibble: 667 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 1 1 949 900 49 NA 1525
## 2 2023 1 1 1023 1000 23 1637 1610
## 3 2023 1 2 919 900 19 1543 1525
## 4 2023 1 2 951 1000 -9 1620 1610
## 5 2023 1 3 922 900 22 1535 1525
## 6 2023 1 3 1007 1000 7 1630 1610
## 7 2023 1 4 912 900 12 1511 1525
## 8 2023 1 4 1001 1000 1 1630 1610
## 9 2023 1 5 854 900 -6 1454 1525
## 10 2023 1 5 949 1000 -11 1600 1610
## # ℹ 657 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
select some columns in the dataset
Example: latitude and longitude
## # A tibble: 1,251 × 2
## lat lon
## <dbl> <dbl>
## 1 29.7 -85.0
## 2 29.7 -95.6
## 3 40.7 -75.4
## 4 32.4 -99.7
## 5 67.1 -158.
## 6 35.0 -107.
## 7 45.4 -98.4
## 8 31.5 -84.2
## 9 41.3 -70.1
## 10 31.6 -97.2
## # ℹ 1,241 more rows
if we add “-” before the name of the columns it displays all the columns except the one listed
## # A tibble: 1,251 × 7
## faa name lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 AAF Apalachicola Regional Airport -85.0 20 -5 A America/N…
## 2 AAP Andrau Airpark -95.6 79 -6 A America/C…
## 3 ABE Lehigh Valley International Airport -75.4 393 -5 A America/N…
## 4 ABI Abilene Regional Airport -99.7 1791 -6 A America/C…
## 5 ABL Ambler Airport -158. 334 -9 A America/A…
## 6 ABQ Albuquerque International Sunport -107. 5355 -7 A America/D…
## 7 ABR Aberdeen Regional Airport -98.4 1302 -6 A America/C…
## 8 ABY Southwest Georgia Regional Airport -84.2 197 -5 A America/N…
## 9 ACK Nantucket Memorial Airport -70.1 47 -5 A America/N…
## 10 ACT Waco Regional Airport -97.2 516 -6 A America/C…
## # ℹ 1,241 more rows
possible to add arguments to give conditions on the columns returned
## # A tibble: 435,352 × 2
## dep_time dep_delay
## <int> <dbl>
## 1 1 203
## 2 18 78
## 3 31 47
## 4 33 173
## 5 36 228
## 6 503 3
## 7 520 10
## 8 524 -6
## 9 537 17
## 10 547 2
## # ℹ 435,342 more rows
exists also ends_with, contains and matches
select can rename the columns ‘on the fly’
## # A tibble: 1,251 × 2
## latitude longitude
## <dbl> <dbl>
## 1 29.7 -85.0
## 2 29.7 -95.6
## 3 40.7 -75.4
## 4 32.4 -99.7
## 5 67.1 -158.
## 6 35.0 -107.
## 7 45.4 -98.4
## 8 31.5 -84.2
## 9 41.3 -70.1
## 10 31.6 -97.2
## # ℹ 1,241 more rows
to rename the columns, function rename
## # A tibble: 1,251 × 8
## faa name latitude longitude altitude tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 AAF Apalachicola Regional Ai… 29.7 -85.0 20 -5 A Amer…
## 2 AAP Andrau Airpark 29.7 -95.6 79 -6 A Amer…
## 3 ABE Lehigh Valley Internatio… 40.7 -75.4 393 -5 A Amer…
## 4 ABI Abilene Regional Airport 32.4 -99.7 1791 -6 A Amer…
## 5 ABL Ambler Airport 67.1 -158. 334 -9 A Amer…
## 6 ABQ Albuquerque Internationa… 35.0 -107. 5355 -7 A Amer…
## 7 ABR Aberdeen Regional Airport 45.4 -98.4 1302 -6 A Amer…
## 8 ABY Southwest Georgia Region… 31.5 -84.2 197 -5 A Amer…
## 9 ACK Nantucket Memorial Airpo… 41.3 -70.1 47 -5 A Amer…
## 10 ACT Waco Regional Airport 31.6 -97.2 516 -6 A Amer…
## # ℹ 1,241 more rows
💡 if the new name (or the old one) contains a space quotes are necessary
to sort the dataset according to one or several columns
## # A tibble: 435,352 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 8 16 1839 1929 -50 2254 2218
## 2 2023 12 2 2147 2225 -38 2332 35
## 3 2023 9 3 1801 1834 -33 1934 2040
## 4 2023 8 13 1804 1834 -30 1947 2040
## 5 2023 1 28 1930 1959 -29 2308 2336
## 6 2023 1 18 2103 2129 -26 2301 2336
## 7 2023 5 5 1129 1155 -26 1337 1420
## 8 2023 4 16 925 950 -25 1102 1140
## 9 2023 11 14 2055 2120 -25 2215 2239
## 10 2023 12 6 2234 2259 -25 2350 30
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
function desc() to reverse the order
## # A tibble: 435,352 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 12 17 1953 1340 1813 2155 1543
## 2 2023 10 1 1240 659 1781 1407 835
## 3 2023 4 25 1201 659 1742 1315 818
## 4 2023 2 7 2045 1700 1665 2352 2025
## 5 2023 4 20 926 619 1627 1135 822
## 6 2023 10 29 856 600 1616 1050 805
## 7 2023 4 30 1818 1617 1561 2001 1820
## 8 2023 3 17 2027 1830 1557 2346 2139
## 9 2023 3 29 633 525 1508 820 700
## 10 2023 3 19 1154 1201 1433 1321 1335
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
several columns
## # A tibble: 435,352 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 1 28 1930 1959 -29 2308 2336
## 2 2023 1 18 2103 2129 -26 2301 2336
## 3 2023 1 30 2105 2129 -24 2340 2350
## 4 2023 1 25 1844 1905 -21 2200 2222
## 5 2023 1 27 539 600 -21 700 720
## 6 2023 1 19 820 840 -20 928 1007
## 7 2023 1 19 1932 1952 -20 2307 2301
## 8 2023 1 25 715 735 -20 916 950
## 9 2023 1 28 1435 1455 -20 1734 1805
## 10 2023 1 9 816 835 -19 1008 1040
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
to create new columns/variables
Creation of a column named air_time_hours
possible to create several columns at once
flights <- mutate(
flights,
distance_km = distance / 0.62137,
speed = distance_km / air_time_hours
)
select(flights, "Air time"=air_time_hours, "Distance (km)"=distance_km, "Speed"=speed)## # A tibble: 435,352 × 3
## `Air time` `Distance (km)` Speed
## <dbl> <dbl> <dbl>
## 1 6.12 4023. 658.
## 2 1.8 1223. 680.
## 3 3.17 2536. 801.
## 4 1.8 1024. 569.
## 5 1.33 785. 589.
## 6 2.57 1746. 680.
## 7 3.2 2536. 793.
## 8 1.98 1157. 583.
## 9 4.3 2253. 524.
## 10 2.62 1714. 655.
## # ℹ 435,342 more rows
Actions can be chained with the pipe. The pipe operator
is |>.
flights |>
filter(dep_delay >= 10 & dep_delay <= 15) |>
arrange(desc(dep_delay)) |>
select(flight, origin, dest, dep_delay, air_time_hours, distance_km, speed) |>
head(5)## # A tibble: 5 × 7
## flight origin dest dep_delay air_time_hours distance_km speed
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 786 EWR RSW 15 2.73 1719. 629.
## 2 910 LGA MSY 15 2.92 1904. 653.
## 3 908 EWR MSY 15 2.88 1878. 651.
## 4 503 EWR SFO 15 5.68 4128. 726.
## 5 27 JFK SLC 15 4.97 3203. 645.
⚠️ the actions are chained (pipeline), but nothing is stored. To keep a track of these actions you have to start with an assignment to an object.
delays <- flights |>
filter(dep_delay >= 10 & dep_delay <= 15) |>
arrange(desc(dep_delay)) |>
select(flight, origin, dest, dep_delay, air_time_hours, distance_km, speed)The result of the pipeline will be stored in the delays
object
💻 Practice
group_by and
slice)## # A tibble: 12 × 22
## # Groups: month [12]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2023 1 1 1 2038 203 328 3
## 2 2023 2 1 453 501 -8 934 943
## 3 2023 3 1 6 2259 67 104 5
## 4 2023 4 1 1 2205 116 111 2317
## 5 2023 5 1 10 2225 105 357 227
## 6 2023 6 1 39 1659 460 325 2000
## 7 2023 7 1 2 2256 66 401 300
## 8 2023 8 1 3 2035 208 224 2330
## 9 2023 9 1 12 2236 96 120 2359
## 10 2023 10 1 58 2250 128 302 118
## 11 2023 11 1 2 2340 22 57 54
## 12 2023 12 1 506 509 -3 807 823
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, air_time_hours <dbl>,
## # distance_km <dbl>, speed <dbl>
Group the data by company, calculate the average delay and place the
result in a new variable called mean_delay (combination of
group_by and mutate)
mean_delay is the average delay of each company
flights |>
group_by(carrier) |>
mutate(mean_delay = mean(dep_delay, na.rm=T)) |>
slice(1) |>
select(carrier, mean_delay)## # A tibble: 14 × 2
## # Groups: carrier [14]
## carrier mean_delay
## <chr> <dbl>
## 1 9E 7.44
## 2 AA 14.2
## 3 AS 12.0
## 4 B6 23.8
## 5 DL 15.1
## 6 F9 35.7
## 7 G4 3.98
## 8 HA 22.9
## 9 MQ 10.5
## 10 NK 18.2
## 11 OO 19.8
## 12 UA 17.6
## 13 WN 16.1
## 14 YX 4.21
💡 See later summarise to get the same information.
Group the data by month and display the flights with the longest
delay (combination of group_by and
filter)
displays the flights with the longest delay for each month
flights |>
group_by(month) |>
filter(dep_delay == max(dep_delay, na.rm=T)) |>
select(month, dep_delay, carrier, origin, dest) |>
arrange(month)## # A tibble: 12 × 5
## # Groups: month [12]
## month dep_delay carrier origin dest
## <int> <dbl> <chr> <chr> <chr>
## 1 1 1201 AA JFK STT
## 2 2 1665 AA LGA DFW
## 3 3 1557 AA EWR DFW
## 4 4 1742 AA LGA DCA
## 5 5 1312 UA LGA ORD
## 6 6 1353 AA LGA DFW
## 7 7 1413 AA LGA DFW
## 8 8 1161 F9 LGA ATL
## 9 9 1369 AA JFK LAX
## 10 10 1781 AA EWR ORD
## 11 11 1074 DL JFK ATL
## 12 12 1813 AA EWR CLT
possible to group by several variables
flights |>
group_by(month, carrier) |>
filter(dep_delay == max(dep_delay, na.rm=T)) |>
select(month, carrier, dep_delay) |>
arrange(month, carrier)## # A tibble: 165 × 3
## # Groups: month, carrier [165]
## month carrier dep_delay
## <int> <chr> <dbl>
## 1 1 9E 773
## 2 1 AA 1201
## 3 1 AS 266
## 4 1 B6 739
## 5 1 DL 1074
## 6 1 F9 1006
## 7 1 G4 84
## 8 1 HA 167
## 9 1 MQ 140
## 10 1 NK 761
## # ℹ 155 more rows
It allows to calculate a summary statistic on a variable.
flights |>
summarise(
mean_dep_delay=mean(dep_delay, na.rm=T),
mean_arr_delay=mean(arr_delay, na.rm=T)
)## # A tibble: 1 × 2
## mean_dep_delay mean_arr_delay
## <dbl> <dbl>
## 1 13.8 4.34
Combined with group_by, summarise is
extremely powerful.
Display the average delay of each company and arrange the result from the longest delay to the shortest
flights |>
group_by(carrier) |>
summarise(mean_dep_delay=mean(dep_delay, na.rm=T)) |>
arrange(desc(mean_dep_delay))## # A tibble: 14 × 2
## carrier mean_dep_delay
## <chr> <dbl>
## 1 F9 35.7
## 2 B6 23.8
## 3 HA 22.9
## 4 OO 19.8
## 5 NK 18.2
## 6 UA 17.6
## 7 WN 16.1
## 8 DL 15.1
## 9 AA 14.2
## 10 AS 12.0
## 11 MQ 10.5
## 12 9E 7.44
## 13 YX 4.21
## 14 G4 3.98
n=n()
To get the information about the number of observations used to make the calculations
flights |>
group_by(carrier) |>
summarise(
mean_dep_delay=mean(dep_delay, na.rm=T),
n=n()
) |>
arrange(desc(mean_dep_delay))## # A tibble: 14 × 3
## carrier mean_dep_delay n
## <chr> <dbl> <int>
## 1 F9 35.7 1286
## 2 B6 23.8 66169
## 3 HA 22.9 366
## 4 OO 19.8 6432
## 5 NK 18.2 15189
## 6 UA 17.6 79641
## 7 WN 16.1 12385
## 8 DL 15.1 61562
## 9 AA 14.2 40525
## 10 AS 12.0 7843
## 11 MQ 10.5 357
## 12 9E 7.44 54141
## 13 YX 4.21 88785
## 14 G4 3.98 671
With two variables in group_by
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 1,274 × 3
## # Groups: month [12]
## month dest n
## <int> <chr> <int>
## 1 3 BOS 2008
## 2 5 BOS 1798
## 3 4 BOS 1768
## 4 2 BOS 1745
## 5 1 BOS 1693
## 6 3 ORD 1654
## 7 3 MCO 1624
## 8 1 ORD 1603
## 9 5 ORD 1603
## 10 12 MCO 1585
## # ℹ 1,264 more rows
💡 If only the effectives are needed, the count function
can be used
## # A tibble: 1,274 × 3
## month dest n
## <int> <chr> <int>
## 1 3 BOS 2008
## 2 5 BOS 1798
## 3 4 BOS 1768
## 4 2 BOS 1745
## 5 1 BOS 1693
## 6 3 ORD 1654
## 7 3 MCO 1624
## 8 1 ORD 1603
## 9 5 ORD 1603
## 10 12 MCO 1585
## # ℹ 1,264 more rows
Variable calculated by the summarise function can be
used to create an new variable with mutate
flights |>
group_by(month, dest) |>
summarise(n=n()) |>
mutate(pourcentage=n/sum(n) * 100) |>
arrange(desc(pourcentage)) |>
head()## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: month [6]
## month dest n pourcentage
## <int> <chr> <int> <dbl>
## 1 3 BOS 2008 5.08
## 2 2 BOS 1745 5.02
## 3 12 MCO 1585 4.75
## 4 4 BOS 1768 4.72
## 5 1 BOS 1693 4.70
## 6 5 BOS 1798 4.64
💻 Practice
ℹ️ Descriptive statistics are numerical summaries that describe and summarize the features of a dataset.
Use of hdv2003
The analysis depends on the type of the variable:
The summary() function displays all this information
directly
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 35.00 48.00 48.16 60.00 97.00
the table() function counts the number of observations
of each category/factor - it is called a contingency table
useNa="ifany" to count for the NA
values##
## Ouvrier specialise Ouvrier qualifie Technicien
## 203 292 86
## Profession intermediaire Cadre Employe
## 160 260 594
## Autre <NA>
## 58 347
If percentages are preferred over effectives, use the function
prop.table with the result of the table
function as argument
##
## Ouvrier specialise Ouvrier qualifie Technicien
## 10.15 14.60 4.30
## Profession intermediaire Cadre Employe
## 8.00 13.00 29.70
## Autre <NA>
## 2.90 17.35
the questionr package provides the freq()
function
useful arguments: valid=F to hide the “valid” columns (columns with NA
values); total=T to add a row with the total, and sort=“dec” (or “inc”)
to display values in the decreasing (increasing) order
## n %
## Employe 594 29.7
## Ouvrier qualifie 292 14.6
## Cadre 260 13.0
## Ouvrier specialise 203 10.2
## Profession intermediaire 160 8.0
## Technicien 86 4.3
## Autre 58 2.9
## NA 347 17.3
## Total 2000 100.0
💻 Practice
heures.tvtrav.impa contingency table with the table function, with two
vectors
##
## Homme Femme
## Ouvrier specialise 96 107
## Ouvrier qualifie 229 63
## Technicien 66 20
## Profession intermediaire 88 72
## Cadre 145 115
## Employe 96 498
## Autre 21 37
prop.table function computes the percentagesmargin argument to indicate the axis (row=1,
column=2) by which the percentages must be computed##
## Homme Femme
## Ouvrier specialise 47.29064 52.70936
## Ouvrier qualifie 78.42466 21.57534
## Technicien 76.74419 23.25581
## Profession intermediaire 55.00000 45.00000
## Cadre 55.76923 44.23077
## Employe 16.16162 83.83838
## Autre 36.20690 63.79310
round() to set the number of decimals
##
## Homme Femme
## Ouvrier specialise 47.29 52.71
## Ouvrier qualifie 78.42 21.58
## Technicien 76.74 23.26
## Profession intermediaire 55.00 45.00
## Cadre 55.77 44.23
## Employe 16.16 83.84
## Autre 36.21 63.79
If we change the axis, the interpretation is different. Below the distribution of professions for each gender.
##
## Homme Femme
## Ouvrier specialise 12.96 11.73
## Ouvrier qualifie 30.90 6.91
## Technicien 8.91 2.19
## Profession intermediaire 11.88 7.89
## Cadre 19.57 12.61
## Employe 12.96 54.61
## Autre 2.83 4.06
questionr provides 2 functions, lprop and cprop (with one decimal by default)
##
## Homme Femme Total
## Ouvrier specialise 47.3 52.7 100.0
## Ouvrier qualifie 78.4 21.6 100.0
## Technicien 76.7 23.3 100.0
## Profession intermediaire 55.0 45.0 100.0
## Cadre 55.8 44.2 100.0
## Employe 16.2 83.8 100.0
## Autre 36.2 63.8 100.0
## Ensemble 44.8 55.2 100.0
some useful arguments: add “%”, add effectives and change the number of decimals
##
## Homme Femme Total n
## Ouvrier specialise 47.29% 52.71% 100.00% 203
## Ouvrier qualifie 78.42% 21.58% 100.00% 292
## Technicien 76.74% 23.26% 100.00% 86
## Profession intermediaire 55.00% 45.00% 100.00% 160
## Cadre 55.77% 44.23% 100.00% 260
## Employe 16.16% 83.84% 100.00% 594
## Autre 36.21% 63.79% 100.00% 58
## Ensemble 44.83% 55.17% 100.00% 1653
Statistical test
##
## Pearson's Chi-squared test
##
## data: tb_prof_genre
## X-squared = 387.56, df = 6, p-value < 2.2e-16
Is the distribution of the numerical (quantitative) variable significantly different across the various levels of the categorical (qualitative) variable?
The mean of the quantitative variable for each category of the qualitative variable
## # A tibble: 2 × 2
## sport `mean(age)`
## <fct> <dbl>
## 1 Non 52.3
## 2 Oui 40.9
Statistical test
Test whether the average age is different depending on whether the individual practices sport or not.
If the quantitative variable follows a normal distribution use the
ttest (student) parametric test, otherwise the
Mann-Whitney test.
if the qualitative variable has more than 2 levels, use the ANOVA
test (parametric, aov()) or Kruskal-Wallis test (non
parametric, kruskal.test())
To test the normality of the distribution, we can use the shapiro test. \(H_0\) the vector follows a normal distribution
##
## Shapiro-Wilk normality test
##
## data: hdv2003$age
## W = 0.98081, p-value = 9.351e-16
the p-value is lower than 5% so we reject \(H_0\), the non-parametric MW test must be used
Mann-Whitney test : wilcox.test with the argument paired=F because the sample are drawn from different populations (athletes and non-athletes are not the same individuals)
##
## Wilcoxon rank sum test with continuity correction
##
## data: age by sport
## W = 640577, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Student test (\(H_0\): the average age is not significantly different between athletes and non-athletes)
##
## Welch Two Sample t-test
##
## data: age by sport
## t = 15.503, df = 1600.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non and group Oui is not equal to 0
## 95 percent confidence interval:
## 9.893117 12.759002
## sample estimates:
## mean in group Non mean in group Oui
## 52.25137 40.92531
In both cases the p-value is lower than 5% which means that the age of athletes is significantly different (here lower) than the age of non-athletes.
Load a new dataset, provided by questionr
These are data on French municipalities with more than 2000
inhabitants ?rp2018 for a description
## [1] "code_insee" "commune" "code_region" "region"
## [5] "code_departement" "departement" "pop_tot" "pop_cl"
## [9] "pop_0_14" "pop_15_29" "pop_18_24" "pop_75p"
## [13] "pop_femmes" "pop_act_15p" "pop_chom" "pop_agric"
## [17] "pop_indep" "pop_cadres" "pop_interm" "pop_empl"
## [21] "pop_ouvr" "pop_scol_18_24" "pop_non_scol_15p" "pop_dipl_aucun"
## [25] "pop_dipl_bepc" "pop_dipl_capbep" "pop_dipl_bac" "pop_dipl_sup2"
## [29] "pop_dipl_sup34" "pop_dipl_sup" "log_rp" "log_proprio"
## [33] "log_loc" "log_hlm" "log_sec" "log_maison"
## [37] "log_appart" "age_0_14" "age_15_29" "age_75p"
## [41] "femmes" "chom" "agric" "indep"
## [45] "cadres" "interm" "empl" "ouvr"
## [49] "etud" "dipl_aucun" "dipl_bepc" "dipl_capbep"
## [53] "dipl_bac" "dipl_sup2" "dipl_sup34" "dipl_sup"
## [57] "resid_sec" "proprio" "locataire" "hlm"
## [61] "maison" "appart"
Correlation coefficient
The coefficient varies between -1 and +1, it measures the strength and the direction of the relation between two quantitative variables. A coefficient close to -1 indicates a strong negative relation (when one variable increases the other decreases), and a coefficient close to 1 a positive relation (both variables go in the same direction).
## [1] 0.9291504
The statistical test can be called with the cor.test()
function
\(H_0\) : the coefficient is equal to 0
##
## Pearson's product-moment correlation
##
## data: rp2018$cadres and rp2018$dipl_sup
## t = 184.94, df = 5415, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9254182 0.9327024
## sample estimates:
## cor
## 0.9291504
we reject \(H_0\), the correlation between the two variables is significantly different from zero
💡 If the relation is not linear the Spearman correlation
test (based on ranks) is more suitable :
cor.test(rp2018$cadres, rp2018$dipl_sup, method="spearman")
💻 Pratice
With the dataset rp2018
With ggplot2 package
We use a subset of the dataset rp2018
rp <- rp2018 |> filter(departement %in% c("Oise", "Rhône", "Hauts-de-Seine", "Lozère", "Bouches-du-Rhône"))
head(rp)## # A tibble: 6 × 62
## code_insee commune code_region region code_departement departement pop_tot
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 13001 Aix-en-Pro… 93 Prove… 13 Bouches-du… 143097
## 2 13002 Allauch 93 Prove… 13 Bouches-du… 20860
## 3 13003 Alleins 93 Prove… 13 Bouches-du… 2516
## 4 13004 Arles 93 Prove… 13 Bouches-du… 51031
## 5 13005 Aubagne 93 Prove… 13 Bouches-du… 47208
## 6 13007 Auriol 93 Prove… 13 Bouches-du… 12771
## # ℹ 55 more variables: pop_cl <fct>, pop_0_14 <dbl>, pop_15_29 <dbl>,
## # pop_18_24 <dbl>, pop_75p <dbl>, pop_femmes <dbl>, pop_act_15p <dbl>,
## # pop_chom <dbl>, pop_agric <dbl>, pop_indep <dbl>, pop_cadres <dbl>,
## # pop_interm <dbl>, pop_empl <dbl>, pop_ouvr <dbl>, pop_scol_18_24 <dbl>,
## # pop_non_scol_15p <dbl>, pop_dipl_aucun <dbl>, pop_dipl_bepc <dbl>,
## # pop_dipl_capbep <dbl>, pop_dipl_bac <dbl>, pop_dipl_sup2 <dbl>,
## # pop_dipl_sup34 <dbl>, pop_dipl_sup <dbl>, log_rp <dbl>, …
Basic principles of ggplot2
ggplot()geom_ (histogram, scatter
plot, boxplot, bars, curves, …)geom_
functions and aes()aes() function maps dataset variables to graphical
elementsgeom_histogram to display the
distribution of a quantitative variable
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
3 options to control the binning of data:
geom_density: to display the density of
a quantitative variable
Both histogram and density plot can be displayed on the same
graph
aes(y=after_stat(density)) to display the density on the
y-axis
ggplot(data=rp, aes(x=cadres)) +
geom_histogram(aes(y=after_stat(density)), breaks=seq(0, 60, 10)) +
geom_density()geom_boxplot: another graph that can be
used to visualize a quantitative variable.
One boxplot for each modalities of a qualitative variable
varwidth=T to have the width of the boxes proportional
to the number of observations in each group
geom_bar: to display the effectives of
each category of a qualitative variable
To display the proportions, we need to use
after_stat(prop) with group=1 to calculate the
proportion for the whole dataset
Another solution is to use geom_col to display
calculated values
rp |>
group_by(departement) |>
summarise(n=n()) |>
mutate(percent=n/sum(n)*100) |>
ggplot() +
geom_col(aes(x=departement, y=percent))The slight difference between geom_bar(after_stat) and
geom_col with summarise lies in when proportions are
computed:
geom_bar(): automatically counts occurrences and
applies after_stat(prop) at the plotting stage. Proportions are computed
after binning, which may introduce slight rounding differences.
summarise() + geom_col(): counts (n = n()) and
proportions (n / sum(n) * 100) are pre-computed before plotting. This
guarantees a direct control over displayed percentages.
💡 Use geom_bar() for a quick approach without
preprocessing and geom_col() for full control over
calculations and additional statistics.
geom_col with polar coordinates
ggplot2 does not provide a geom_pie function, but it is
possible to create a pie chart with geom_bar and polar
coordinates
rp |>
group_by(departement) |>
summarise(n=n()) |>
mutate(percent=n/sum(n)*100) |>
ggplot() +
geom_col(aes(x="", y=percent, fill=departement)) +
coord_polar("y", start=0)geom_point
to add the regression line in the graph, use the geom_smooth function
## `geom_smooth()` using formula = 'y ~ x'
💡 Since the aes function should be used twice, we can add it in the ggplot call
geom_line
Load dataset ‘economics’ from the package ‘ggplot2’
A fourth dimension through the size of the points which represents the total population of the commune
Possible to customize a mapping with scale_
ggplot(rp) +
geom_point(aes(x=dipl_sup, y=cadres, color=departement, size=pop_tot)) +
scale_size("Population", breaks=c(0, 5000, 10000, 50000, 100000), range=c(0, 10))Repeat a graph several times according to one or several variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
labs function to add titles and labels
ggplot(rp) +
geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) +
labs(title="Relation between Executives and Graduates", x="Graduates", y="Executives")scale_x_continuous, scale_y_continuous to
customize the axes
ggplot(rp) +
geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) +
scale_x_continuous("Graduates", breaks=seq(0, 60, 10), limits=c(0, 60)) +
scale_y_continuous("Executives", breaks=seq(0, 60, 10), limits=c(0, 60))scale_x_discrete, scale_y_discrete for
qualitative variables
ggplot(rp) +
geom_bar(aes(x=departement)) +
scale_x_discrete("Departement") +
scale_y_continuous("Number of observations")With labs
Practice
With the dataset rp2018
Linear regression is a technique used for modelling the relationship between a dependent variable and one or more independent variables.
lm(formula, data) function, where
formula specifies the relationship between the variables, and data is
the datasety ~ x1 + x2 + ... + xn, where
y is the dependent variable and
x1, x2, ..., xn are the independent variablesx1 <- c(1, 2, 3, 4, 5)
x2 <- c(2, 3, 2, 5, 4)
y <- c(2, 3, 5, 7, 11)
lm_mod <- lm(y ~ x1 + x2)
summary(lm_mod)##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## 1 2 3 4 5
## 0.8 -0.3 -0.9 -0.5 0.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6500 1.4921 -0.436 0.7056
## x1 2.3500 0.5256 4.471 0.0466 *
## x2 -0.2500 0.6374 -0.392 0.7327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.14 on 2 degrees of freedom
## Multiple R-squared: 0.9492, Adjusted R-squared: 0.8984
## F-statistic: 18.69 on 2 and 2 DF, p-value: 0.05078
glm() function,
which stands for “generalized linear model”glm(formula, data, family), where family
specifies the error distribution and link function to use. For logistic
regression, we use family = binomialx <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(0, 0, 0, 0, 1, 0, 1, 1, 1, 1)
glm_mod <- glm(y ~ x, family=binomial)
summary(glm_mod)##
## Call:
## glm(formula = y ~ x, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.159 4.759 -1.504 0.133
## x 1.302 0.840 1.550 0.121
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13.863 on 9 degrees of freedom
## Residual deviance: 5.018 on 8 degrees of freedom
## AIC: 9.018
##
## Number of Fisher Scoring iterations: 6
ℹ️ The coefficients are odds ratios: for a one unit increase in the independent variable, the odds of the outcome occurring are \(e^\beta\) times larger, where \(\beta\) is the coefficient for the independent variable. If \(β = 1.3\), then \(e^\beta = 3.675\). This means that for each additional unit of \(x\), the odds of \(y=1\) are 3.67 times higher.
To convert odds to a probability : Probability = Odds / (1 + Odds). With odds of 3.67, the probability would be 3.67 / (1 + 3.67) = 0.786 or 78.6%. So, for each additional unit of \(x\), the probability of \(y=1\) increases by 78.6%, ceteris paribus.
multinom()
function from the nnet packageThe coefficients in the output can be interpreted similarly to binary logistic regression, but each coefficient now corresponds to the log of the odds of the outcome being in the corresponding category versus the reference category.
ℹ️ Exponentiating the coefficient transforms it from log-odds back into plain odds. This value tells you how the odds of being in the specified category (compared to the reference category) change with a one-unit increase in the independent variable. A value above 1 indicates an increase in odds, a value below 1 indicates a decrease, and a value of 1 means no change. Example : Suppose you have a multinomial logistic regression model predicting the preferred type of pet (cats, dogs, or birds, with “birds” as the reference category) based on an independent variable like age. If the coefficient for “cats” is 0.5, then the log-odds of preferring cats over birds increase by 0.5 for each additional year of age. Exponentiating 0.5 gives you the odds ratio (about 1.65), indicating that with each additional year of age, the odds of preferring cats over birds increase by 65%.
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3))
multi_mod <- multinom(y ~ x)## # weights: 9 (4 variable)
## initial value 10.986123
## iter 10 value 0.934296
## iter 20 value 0.544624
## iter 30 value 0.329800
## iter 40 value 0.222875
## iter 50 value 0.142208
## iter 60 value 0.104801
## iter 70 value 0.066253
## iter 80 value 0.048903
## iter 90 value 0.046977
## iter 100 value 0.045428
## final value 0.045428
## stopped after 100 iterations
## Call:
## multinom(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 2 -36.19612 10.27190
## 3 -89.31736 18.44352
##
## Std. Errors:
## (Intercept) x
## 2 68.70837 18.81918
## 3 99.69833 21.83700
##
## Residual Deviance: 0.0908552
## AIC: 8.090855
plm packageplm(formula, data, index, model), where index
is a vector giving the names or indices of the individual and the time
indexes, and model specifies the type of model: within, random or
poolingdata("Produc", package = "plm")
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"), model = "within")
summary(model)## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
## data = Produc, model = "within", index = c("state", "year"))
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.120456 -0.023741 -0.002041 0.018144 0.174718
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## log(pcap) -0.02614965 0.02900158 -0.9017 0.3675
## log(pc) 0.29200693 0.02511967 11.6246 < 2.2e-16 ***
## log(emp) 0.76815947 0.03009174 25.5273 < 2.2e-16 ***
## unemp -0.00529774 0.00098873 -5.3582 1.114e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 18.941
## Residual Sum of Squares: 1.1112
## R-Squared: 0.94134
## Adj. R-Squared: 0.93742
## F-statistic: 3064.81 on 4 and 764 DF, p-value: < 2.22e-16
Rmq: ?Produc to get information about the datasets
lme4, and once loaded, the lmer function,
which is used to fit linear and generalized linear mixed-effects
modelslmer(response ~ predictors + (1|group), data), where
the (1|group) term specifies a random intercept for the
variable groupThe coefficients for the fixed effects can be interpreted in the usual way: for a one unit increase in the independent variable, the dependent variable will increase by the estimated coefficient, ceteris paribus. The random effects variance components indicate the degree of variability across the levels of the random effects variable(s).
data("sleepstudy", package = "lme4")
mlev_mod <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(mlev_mod)## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1786.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2257 -0.5529 0.0109 0.5188 4.2506
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 1378.2 37.12
## Residual 960.5 30.99
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.4051 9.7467 25.79
## Days 10.4673 0.8042 13.02
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.371
Another example, with two random effects : s, d and dept
Dataset InstEval from the lme4 package
data("InstEval", package = "lme4")
model_extended <- lmer(y ~ service + (1 | s) + (1 | d), data = InstEval)## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ service + (1 | s) + (1 | d)
## Data: InstEval
##
## REML criterion at convergence: 237743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0564 -0.7480 0.0406 0.7719 3.1969
##
## Random effects:
## Groups Name Variance Std.Dev.
## s (Intercept) 0.1057 0.325
## d (Intercept) 0.2715 0.521
## Residual 1.3866 1.178
## Number of obs: 73421, groups: s, 2972; d, 1128
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.28328 0.01881 174.511
## service1 -0.09113 0.01327 -6.867
##
## Correlation of Fixed Effects:
## (Intr)
## service1 -0.226
Interpretation:
Analyse of variance
Principal Components Analysis (PCA)
Steps:
Example: PCA on the Iris dataset.
Can be performed with the stats package (already
installed) or with FactoMineR and
factoextra
FactoMineR is used to perform the analysisfactoextra is used to extract information for the
analysis and provides visualizations (fviz_)with stats
iris_num <- iris[, -5] # Remove the species column
pca_result <- prcomp(iris_num, scale = TRUE, center = TRUE)center = TRUE -> Mean-Centering the Datascale = TRUE -> Standardizing the Data## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
PC1 + PC2 explain 95.8% of the total variance in the data.
PC3 adds 4% more variance, but PC4 contributes almost nothing.
2 PCs is sufficient.
The screen plot is a visual representation of the variance explained
by each principal component. It helps to decide how many components to
keep.
The elbow method is often used to identify the number of components that
explain most of the variance.
The point where the curve starts to flatten is a good indicator of the
number of components to keep.
## PC1 PC2 PC3 PC4 Species
## 1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa
## 2 -2.074013 0.6718827 0.23382552 0.102662845 setosa
## 3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa
## 4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa
## 6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
pca_result$x contains the coordinates of the original
observations projected onto the principal components (PCs).
Each row represents an observation (data point), and each column
represents a Principal Component (PC1, PC2, etc.).
Scatter plot of the first two principal components, colored by species.
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point(size = 3) +
labs(title = "PCA Projection of Iris Dataset")Same analysis with FactoMineR and
factoextra
##
## Call:
## PCA(X = iris_num, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4
## Variance 2.918 0.914 0.147 0.021
## % of var. 72.962 22.851 3.669 0.518
## Cumulative % of var. 72.962 95.813 99.482 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 2.319 | -2.265 1.172 0.954 | 0.480 0.168 0.043 | -0.128
## 2 | 2.202 | -2.081 0.989 0.893 | -0.674 0.331 0.094 | -0.235
## 3 | 2.389 | -2.364 1.277 0.979 | -0.342 0.085 0.020 | 0.044
## 4 | 2.378 | -2.299 1.208 0.935 | -0.597 0.260 0.063 | 0.091
## 5 | 2.476 | -2.390 1.305 0.932 | 0.647 0.305 0.068 | 0.016
## 6 | 2.555 | -2.076 0.984 0.660 | 1.489 1.617 0.340 | 0.027
## 7 | 2.468 | -2.444 1.364 0.981 | 0.048 0.002 0.000 | 0.335
## 8 | 2.246 | -2.233 1.139 0.988 | 0.223 0.036 0.010 | -0.089
## 9 | 2.592 | -2.335 1.245 0.812 | -1.115 0.907 0.185 | 0.145
## 10 | 2.249 | -2.184 1.090 0.943 | -0.469 0.160 0.043 | -0.254
## ctr cos2
## 1 0.074 0.003 |
## 2 0.250 0.011 |
## 3 0.009 0.000 |
## 4 0.038 0.001 |
## 5 0.001 0.000 |
## 6 0.003 0.000 |
## 7 0.511 0.018 |
## 8 0.036 0.002 |
## 9 0.096 0.003 |
## 10 0.293 0.013 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Sepal.Length | 0.890 27.151 0.792 | 0.361 14.244 0.130 | -0.276 51.778
## Sepal.Width | -0.460 7.255 0.212 | 0.883 85.247 0.779 | 0.094 5.972
## Petal.Length | 0.992 33.688 0.983 | 0.023 0.060 0.001 | 0.054 2.020
## Petal.Width | 0.965 31.906 0.931 | 0.064 0.448 0.004 | 0.243 40.230
## cos2
## Sepal.Length 0.076 |
## Sepal.Width 0.009 |
## Petal.Length 0.003 |
## Petal.Width 0.059 |
screeplot
Biplot: displays both the observations and the
variables on the same plot.
The observations are represented by points, and the variables are
represented by arrows.
fviz_pca_biplot(pca_result_2, label="var", habillage=as.factor(iris$Species), addEllipses = T, repel = TRUE)K-means clustering is a technique used for partitioning a dataset into K clusters based on their similarity. It is an unsupervised learning method that can be used for exploratory data analysis.
Clusters are built around centroids, which are the mean of the data points in each cluster. The goal is to minimize the sum of squared distances between each data point and the centroid of its cluster, and to maximize the distance between centroids.
We use iris_num dataset, already standardized and
without the species column.
As for dimensionality reduction, standardizing the data is important for
k-means clustering, otherwise variables with larger scales might
dominate the clustering process.
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 5.901613 2.748387 4.393548 1.433871
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
💡 Notes:
Here we set 3 clusters because we know that the Iris dataset has 3
species.
But in practice, the number of clusters is often unknown and must be
determined using methods like the elbow method, silhouette method, or
gap statistic.
We add the cluster column to the original dataset
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 14 36
Comparison between the clusters and the actual species shows that the clustering algorithm has done a good job of separating the species into different clusters.
Visualizing the Clusters
We can visualize the clusters by plotting the first two principal
components and coloring the points by cluster.
We indeed saw that with PCA, the first two components explain most of
the variance in the data.
Remember:
pca_data <- data.frame(pca_result$x, Species = iris$Species),
where pca_result$x contains the coordinates of the original
observations projected onto the principal components
(PCs).
ggplot(pca_data, aes(x = PC1, y = PC2, color = as.factor(kmeans_result$cluster))) +
geom_point(size = 3) +
labs(title = "K-Means Clustering on PCA Projection", color = "Cluster")Choose the optimal number of clusters with the Elbow Method
The Elbow Method is a heuristic technique used to determine the optimal number of clusters in a dataset. It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters, and looking for the “elbow” point where the rate of decrease slows down.
We test different numbers of clusters (from 1 to 10) and calculate the WCSS for each.
Then we plot the WCSS against the number of clusters.
ggplot(data.frame(k = 1:10, wcss), aes(x = k, y = wcss)) +
geom_line() +
geom_point() +
labs(title = "Elbow Method for Optimal Number of Clusters", x = "Number of Clusters", y = "Within-Cluster Sum of Squares")The “elbow” point is where the rate of decrease slows down. In this case, it’s at k = 3, which confirms that 3 clusters is a good choice for the Iris dataset.